Model Selection

Multimodal large model

# Multimodal large model

Heron NVILA Lite 33B

Heron-NVILA-Lite-33B is a vision-language model based on the NVILA-Lite architecture, specifically trained for Japanese, and supports multimodal tasks in both Japanese and English.

Image-to-Text Supports Multiple Languages

Internvl3 2B Hf

InternVL3-2B is a multimodal large language model implemented based on the Hugging Face Transformers library. It performs excellently in multimodal tasks such as image, video, and text processing, supporting multiple input methods and efficient batch inference.

Transformers Other

Qari OCR 0.3 SNAPSHOT VL 2B Instruct Merged

A vision-language model designed specifically for Arabic optical character recognition (OCR), capable of directly recognizing Arabic text in images.

Internlm Xcomposer2d5 Ol 7b

InternLM-XComposer2.5-OL is a comprehensive multimodal system supporting long-term streaming video and audio interaction.

Xgen Mm Phi3 Mini Base R V1

XGen-MM is the latest multimodal large model series developed by Salesforce AI Research. Based on the successful design of BLIP, it achieves a more powerful and superior model architecture through fundamental enhancements.

Transformers English

Internlm Xcomposer2 Vl 1 8b

A vision-language large model based on InternLM2 with outstanding image-text understanding and creation capabilities

Featured Recommended AI Models

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase